10 research outputs found
Recommended from our members
Seedability: optimizing alignment parameters for sensitive sequence comparison
Data availability:
The data underlying this article are available either in https://github.com/lorrainea/Seedability or in the ensembl database at https://www.ensembl.org, and can be accessed using the gene names ENSPTRG00000044036 and ENSG00000174236 or in the NCBI database at https://www.ncbi.nlm.nih.gov and can be found using the reference sequence NC_000001.11.Motivation:
Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2â , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedabilityâ , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences.
Results:
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
Availability and implementation:
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).R.C. was supported by ANR Full-RNA, SeqDigger, Inception, and PRAIRIE grants (ANR-22-CE45-0007, ANR-19-CE45-0008, PIA/ANR16-CONV-0005, ANR-19-P3IA-0001). This project has received funding from the European Unionâs Horizon 2020 research and innovation programme under the Marie SkĆodowska-Curie grant agreements No. 872539 (PANGAIA) and 956229 (ALPACA)
String Sanitization: A Combinatorial Approach
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a userâs loc
Fast Indexes for Gapped Pattern Matching
We describe indexes for searching large data sets for variable-length-gapped
(VLG) patterns. VLG patterns are composed of two or more subpatterns, between
each adjacent pair of which is a gap-constraint specifying upper and lower
bounds on the distance allowed between subpatterns. VLG patterns have numerous
applications in computational biology (motif search), information retrieval
(e.g., for language models, snippet generation, machine translation) and
capture a useful subclass of the regular expressions commonly used in practice
for searching source code. Our best approach provides search speeds several
times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454
and has received funding from the European Union's Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie Actions
H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Subframe Temporal Alignment of Non-Stationary Cameras
This paper studies the problem of estimating the sub-frame temporal off-set between unsychronized, non-stationary cameras. Based on motion trajec-tory correspondences, the estimation is done in two steps. First, we propose an algorithm to robustly estimate the frame accurate offset by analyzing the trajectories and matching their characteristic time patterns. Using this result, we then show how the estimation of the fundamental matrix between two cameras can be reformulated to yield the sub-frame accurate offset from nine correspondences. We verify the robustness and performance of our approach on synthetic data as well as on real video sequences.
String sanitization: a combinatorial approach
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a userâs location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct
The Amborella genome and the evolution of flowering plants.
Amborella trichopoda is strongly supported as the single living species of the sister lineage to all other extant flowering plants, providing a unique reference for inferring the genome content and structure of the most recent common ancestor (MRCA) of living angiosperms. Sequencing the Amborella genome, we identified an ancient genome duplication predating angiosperm diversification, without evidence of subsequent, lineage-specific genome duplications. Comparisons between Amborella and other angiosperms facilitated reconstruction of the ancestral angiosperm gene content and gene order in the MRCA of core eudicots. We identify new gene families, gene duplications, and floral protein-protein interactions that first appeared in the ancestral angiosperm. Transposable elements in Amborella are ancient and highly divergent, with no recent transposon radiations. Population genomic analysis across Amborella's native range in New Caledonia reveals a recent genetic bottleneck and geographic structure with conservation implications